Robust clause boundary identification for corpus annotation
نویسندگان
چکیده
The paper describes a rule-based system for tagging clause boundaries, implemented for annotating the Estonian Reference Corpus of the University of Tartu, a collection of written texts containing ca 245 million running words and available for querying via Keeleveeb language portal. The system needs information about parts of speech and grammatical categories coded in the word-forms, i.e. it takes morphologically annotated text as input, but requires no information about the syntactic structure of the sentence. Among the strong points of our system we should mention identifying parenthesis and embedded clauses, i.e. clauses that are inserted into another clause dividing it into two separate parts in the linear text, for example a relative clause following its head noun. That enables a corpus query system to unite the otherwise divided clause, a feature that usually presupposes full parsing. The overall precision of the system is 95% and the recall is 96%. If “ordinary” clause boundary detection and parenthesis and embedded clause boundary detection are evaluated separately, then one can say that detecting an “ordinary” clause boundary (recall 98%, precision 96%) is an easier task than detecting an embedded clause (recall 79%, precision 100%).
منابع مشابه
Automatic Clause Boundary Annotation in the Hindi Treebank
In this paper, we propose a method for automatic clause boundary annotation in the Hindi Dependency Treebank. We show that the clausal information implicitly encoded in a dependency structure can be made explicit with no or less human intervention. We exercised the proposed approach on 16,000 sentences of Hindi Dependency Treebank. Our approach gives an accuracy of 94.44% for clause boundary id...
متن کاملEnriching a Treebank to Investigate Relative Clause Extraposition in German
I describe the construction of a corpus for research on relative clause extraposition in German based on the treebank TüBa-D/Z. I also define an annotation scheme for the relations between relative clauses and their antecedents which is added as a second annotation level to the syntactic trees. This additional annotation level allows for a direct representation of the relevant parts of the rela...
متن کاملTowards an Annotation of Syntactic Structure in the Swedish Sign Language Corpus
This paper describes on-going work on extending the annotation of the Swedish Sign Language Corpus (SSLC) with a level of syntactic structure. The basic annotation of SSLC in ELAN consists of six tiers: four for sign glosses (two tiers for each signer; one for each of a signer’s hands), and two for written Swedish translations (one for each signer). In an additional step by Östling et al. (2015...
متن کاملSituation Entity Annotation
This paper presents an annotation scheme for a new semantic annotation task with relevance for analysis and computation at both the clause level and the discourse level. More specifically, we label the finite clauses of texts with the type of situation entity (e.g., eventualities, statements about kinds, or statements of belief) they introduce to the discourse, following and extending work by S...
متن کاملClause Boundary Identification for Malayalam Using CRF
This paper presents a clause boundary identification system for Malayalam sentences using the machine learning approach CRF (Conditional Random Field).Malayalam Language is considered as a 'Left branching language' where verbs are seen at the end of the sentence. Clause boundary identification plays a vital role in many NLP applications and for Malayalam language, the clause boundary identifica...
متن کامل